Q-Herilearn: Assessing heritage learning in digital environments. A mixed approach with factor and IRT models

The assessment of heritage learning in digital environments lacks instruments that measure it with sufficient guarantees of accuracy, validity, and reliability. This study attempts to fill this gap by developing an instrument that has shown solid metric qualities. The process of design and calibration of a scale applied to 1,454 participants between 19 and 63 years of age is presented in this article. Exploratory factor analysis (Exploratory Structural Equation Modeling ESEM) and Item Response Theory models (Graded Response Model GRM) were used. Sufficient evidence of both reliability and validity based on content and internal structure was obtained. Invariance of scores as a function of gender and age of participants has also been demonstrated. The discrimination parameters of the items have been found to be high, and the test information curves have shown that the subscales measure with sufficient precision wide ranges of the respective latent variables. The instrument presents wide possibilities of application to various areas of Heritage Education (e.g., design of programs in HE, definition and planning of teaching objectives, evaluation of programs, etc., in virtual environments).


Introduction
In the last decade, digital environments have positioned themselves as burgeoning educational settings for teaching cultural heritage, not only due to their massive use, but also because of the potential they represent for learning in the sphere of heritage education [1].Digital media are frequently presented as extensions or complements of real physical environments; for this reason, heritage learning outcomes obtained in digital environments are measured in research work in close connection with the geographical context [e.g., [2][3][4][5].Some studies, however, go beyond such spatial references and instead focus on digital environments as specific (informal) heritage learning settings, so that they are understood as stand-alone informal learning environments [6,7].
The evaluation of learning in heritage education has been dispersed in terms of the targets of measurement, which cover the upgrading of acquired knowledge [8], the development of competencies [9], sensory-motor learning [10], the learning experience, the enjoyment derived from the latter [11], the attitudes towards heritage [12], and even social learning outcomes [13].When they rely on previous designs and interventions, studies usually measure the specific effects derived from their implementation [14,15].In the particular case of evaluation of technology-mediated learning, the studies deal with the impact of mobile-learning heritage knowledge [16,17] or the effectiveness of certain technological resources for achieving heritage learning goals [18,19], including analyzing the quality of learning from a psychoneurological perspective [20].
Among the studies specifically dedicated to the evaluation of heritage learning in digital environments, some are concerned with gauging the effects of intrinsic motivation and competence obtained by means of virtual-reality-based learning, which are compared with traditional text-based learning [21]; it has also become possible to evaluate the potential of portals and other synchronous learning platforms to promote empathy among diverse cultural populations, considering that standard heritage spaces (for example, museums) should adopt synchronous learning to develop a more participatory and dynamic educational model [22].Along this line, which seeks to combine face-to-face and virtual experience, the cognitive, emotional and social dimensions involved in the learning process have likewise become the object of analysis [23], and so have the processes linked to the transmission of heritage values on social media [24].
Despite the large number of studies related to heritage learning in digital environments, almost all of them put the focus on the implementation of innovation, and have an exploratory nature, with the limitations that this entails in terms of generalizing results.With the exception of the Instructional Materials Motivation Survey Questionnaire [25]-which evaluates attention, confidence and satisfaction factors-and the intuitive evaluation system designed by Lee et al., (2016) [26]-which attempts to measure the affective, cognitive and operational dimensions in learning processes-there are no specific studies on any instrument that measure learning outcomes in digital heritage education environments.In the studies collected, ad hoc questionnaires have mostly been used for the specific designs under scrutiny [8,18,25] in which no description is provided of the processes of calibration or validation of the scores that were followed.This ad hoc approach makes it difficult to perform reliable comparisons between results from various studies that measure the same concept.
Likewise, the evaluation of learning in digital heritage education environments has not been constructed on the basis of an organized sequence that identifies the main dimensions or latent variables.All of this makes it necessary to deploy a standardized scale, capable of accurately measuring the constructs of a sequence of heritage processes in different contexts, environments and actions.Furthermore, a scale is required that allows results to be compared across different groups and populations, using standard scores to evaluate the effectiveness of different heritage education programs or, where appropriate, measure changes in heritage learning outcomes.
Following from the design and calibration method of the Q-Edutage scale focused on the evaluation of heritage education programs [27], we propose to lay out and calibrate a scale articulated around seven factors underpinned by the seven verbs in the Heritage Learning Sequence (HLS) which define the main learning actions concerning heritage (i.e., knowing, understanding, respecting, valuing, caring, enjoying and transmitting: Fontal et al., 2022 [28]) and make up the seven dimensions of the Q-Herilearn scale that we present here.These terms comprise the educational action that results in heritage learning outcomes in digital environments and are identified following the theoretical model that supports the HLS, in turn inspired by the content analysis of the main international texts, treaties and recommendations (UN, UNESCO, EU) in matters of heritage [29] as well by the analysis of the main verbs used in the conceptualization of heritage by users of digital environments [28]: 1. Knowing: Acquiring an understanding of the range of cultural assets that are part of the historical and cultural heritage of a society or community.
2. Understanding: Comprehending the meaning of heritage, its historical, cultural and social context, as well as the relationships and connections between different heritage items.
3. Respecting: Adopting an attitude of care, appreciation, commitment and responsibility towards heritage.
4. Valuing: Appreciating the importance and significance of heritage, recognizing its valuable qualities for a community.
5. Caring: Taking action to protect, conserve and preserve heritage for present and future generations.
6. Enjoying: Actively experiencing and appreciating heritage for pleasure and personal enrichment.
7. Transmitting: Effectively sharing and communicating the knowledge, values, traditions, stories and significance of heritage to present and future generations.

Study goals
As a result of the above considerations, the present study aims to (a) develop an instrument with sound metric qualities that assesses how we learn heritage in digital environments and (b) calibrate the instrument itself by using a mixed approach based on measurement models (Exploratory Etructural Equations Models) and Item Response Theory.

Research design and hypotheses
This work follows the methodology of cross-sectional survey designs, the essential purpose of which is to provide a quantitative description of participants' opinions as expressed through responses to structured questionnaires [30,31].The exploratory study starts from the HLS, which identifies the seven main verbs in heritage learning set out above.These verbs constitute the seven dimensions of the Heritage Process Model (HPM, Fontal et al., 2022 [28]).Each of the latent variables is assessed by means of 7 indicators.Both unidimensional models and an ESEM model consisting of the 49 items and the seven factors or dimensions have been analysed (see Fig 1).The hypotheses are derived directly from these models, and are as follows: (a) each of the dimensions (knowing, understanding, respecting, valuing, caring, enjoying, transmitting) is measured by 7 indicators, as depicted in Fig 1A , and (b) the indicator loadings will be significant and higher on each reference factor than on the rest of the factors, as shown in Fig 1B.

Participants
The final sample consisted of N = 1,454 participants aged 19 to 63 years (M = 26.71,SD = 10.51).For some of the analyses below, the variable age was categorized into six groups.
The defining characteristics of the participants (age, gender, country of residence, number of countries visited, area of residence, mother tongue, level of education) are summarized in Table 1.
All participants completed an online survey (https://oepe.es/escala-herilearn/) between May 9, 2022 and September 2, 2023, after being informed of the purposes of the study and guaranteed complete data confidentiality, in accordance with the provisions of the CEISH UPV-EHU Ethics Committee (Cod: M10_2021_31).They were also informed that the survey consisted of 97 items.Participants could interrupt, postpone or abandon the survey at any time (in the latter case, the data were automatically deleted).A total of 1,389 responses were obtained with complete socio-demographic information, plus 65 in which only some of the fields were filled out.Acceptance of informed consent was a prerequisite for responding to the survey.

Sample size, power and precision
In order to determine the minimum sample size, we took into account (a) statistical power (at least 80%); (b) effect size (ƒ 2 � .35)and (c) significance level (α = .05).To calibrate the precision and power achieved by the analysis given the sample size used (N = 1,328), we performed a Monte Carlo analysis (10,000 replicates) using as population parameters the results of the structural model (see Supporting Information, S9 Table ), as recommended by Muthe ´n & Muthe ´n (2002) [32].
The analysis was performed with Mplus, v. 8.10 [33], and convergence was achieved without problems in 100% of the requested replicates.S9 Table (Supporting Information) shows the results on the parameters of the structural model.The population parameters and the means of the parameters estimated by the model were very similar in all cases, suggesting the absence of bias in the estimation.Similar results were observed in the estimation of the standard error, with no evidence of relevant bias in any of the parameters analyzed.The Mean Squared Error values (MSE) were in all cases very close to zero, confirming the absence of bias observed in the comparison between population and simulated parameters.Between 94% and 96% of the replicates contained a population value with a 95% confidence interval.For population parameters greater than zero, the test reached the maximum power (1,000) in all cases.For population parameters with a value of zero, the proportion of replicates in which the parameter was significant always remained close to the desired value of .05.In conclusion, the results of the Monte Carlo analysis suggest that with this sample size very precise estimates of the model parameters were achieved, with high power and a low probability of Type I error.

Data collection and cleaning
Data were retrieved from the LimeSurvey platform, transferred to R and cleaned using the following three strategies: outlier filtering, multivariate outlier detection and missing data processing.

Selection of anomalous responses
Anomalous response patterns (e.g., repetitive, invariant, random or sloppy responses) can profoundly alter the results of data analysis, even if they occur in very small proportions [34,35].
To avoid this bias, the data were cleaned in two ways: first, we eliminated cases where the same answer was given to all 49 items (Straight Lining) (N = 9 = 0.62%), considering that this pattern is highly improbable given the number of test items.Secondly, we estimated the polytomous mode of the standardized likelihood ratio l p z [36,37] for each response vector.Extreme values in the left tail of the l p z (� -3) indicate highly unexpected response patterns, as predicted by the measurement model: these patterns are usually the result of random responses not based on item content.The cut-off point set at -1.6308 identified 71 cases (4.88%) with anomalous responses.The more conservative cut-off point of -3.00 identified 26 anomalous responses (1.79%), as seen in Fig 2, which shows the histogram and density of PFS (Person Fit Scores).Consequently, the 26 cases with l p z � -3 were excluded from further analysis.

Detection of multivariate outliers
As In summary, the combination of the procedures described above resulted in 8.67% of participants (N = 126) meeting one of the selection criteria and therefore being removed from the database for further analysis.S10 Table (Supporting Information) provides a summary of the eliminated cases.

Treatment of missing data
Given the sufficient sample size, the low proportion of cases with missing data (< 3%), the high average data coverage (> 98%) and the MCAR structure (Little's test: χ 2 (5056) = 5212.605,p = .061),multiple imputation was considered unnecessary and Full Information Maximum Likelihood (FIML) was used to estimate the parameters of the factor models, using all available data [38].

Analysis procedures
Two types of analysis have been employed: (a) factor analysis and (b) analysis using Item Response Theory models.
Factor analysis.Factor analysis was conducted along three phases.The aim of the first phase was to estimate the fit of each subscale to the one-dimensional confirmatory factor model, so as to estimate the convergent validity of the items, and to verify that each subscale acquired sufficient reliability and internal consistency.To this purpose, seven unidimensional confirmatory models were estimated for each factor (see Fig 1A), as well as the average variance extracted (AVE), Cronbach's ordinal alpha, McDonald's omega, composite reliability (CR) and the Great Lower Bound of Reliability (GLB).
The aim of the second phase of the analysis was to investigate whether (a) it is possible to recover the theoretical structure of the measure from the pooled data, and (b) the items have sufficient discriminative ability, i.e., they measure their theoretical factor substantially better than the rest of the factors.For this purpose, an exploratory structural equation model (ESEM; [39,40]) was estimated using all items of the scale simultaneously (Fig 1B).Oblique target rotation was used.Target rotation allows items to load freely on their reference factor, and seeks the rotated solution where the cross-loadings are as close as possible to the expected size according to the theoretical starting model (in this case, zero).Thus, by allowing the expression of a priori hypotheses about the pattern of primary loadings and cross-loadings, the target rotation allows the ESEM to be used in a semi-confirmatory way [39].
The aim of the third phase was to verify compliance with measurement invariance by gender and age.In the case of gender, two nested ESEM models were estimated [41]: configural (equivalence of number and layout of factors), and scalar (equivalence of primary loadings and cross-loadings, and of thresholds).In the case of age, it being a continuous variable, we chose an approach based on multiple indicator multiple cause models (MIMIC; [42]), following the recommendations of Morin et al. (2016) [43], to assess invariance by comparing the fit of nested MIMIC models.With age as the predictor variable, two models were compared: (a) an invariant model, where regression coefficients between age and each of the factors were estimated, restricting any direct correlation between age and item responses to zero, and (b) a saturated model, which assumes no scalar invariance, restricting any correlation between age and the factors, and estimating regression coefficients between age and each of the items.If the fit of the invariant (more parsimonious) model is similar to the fit of the saturated model, one can with reasonable confidence rule out the presence of serious violations of scalar invariance.
All factor models were estimated using Weighted Least Squares Mean and Variance Adjusted (WLSMV), given the ordinal nature of the item responses [44].Goodness of fit was assessed using the Comparative Fit Index (CFI), the Tucker-Lewis Index (TLI), and the Root Mean Square Error of Approximation (RMSEA).Conventionally, CFI and TLI values above .90and .95respectively indicate acceptable and good fit, [45,46].In the case of RMSEA, values at or below .05and .08 are respectively considered good and acceptable [47].
In order to make decisions on the significance of differences in fit between nested models, we followed the recommendations of Chen (2007) [48] and Cheung & Rensvold (2002) [49], according to which increases of less than .01 in CFI and TLI, and decreases of less than .015 in RMSEA suggest that there is no relevant change in the fit of one model with respect to the next most restrictive one.In addition, maximum likelihood with robust standard errors (MLR) was applied on the data treated as categorical variables to estimate the Bayesian Information Criterion (BIC) and the Akaike Information Criterion (AIC): when comparing two nested models, lower values of BIC and AIC suggest a better fit.
Analysis using item response theory models.Once the structure of the data had been analysed, we conducted a detailed analysis of the items' properties by estimating IRT models.As a preliminary step, we investigated the dimensionality of each theoretical factor in order to ensure that the data were suitable for analysis using unidimensional IRT models.In order to secure sufficient compliance with unidimensionality and conditional independence, each scale had to meet the following requirements: (a) the percentage of variance explained by the second factor should not exceed that explained by random data simulated by optimized parallel analysis [50]; (b) the Explained Common Variance ECV of the first factor should be greater than.80; (c) the Mean of Items Residuals Absolute Loadings (MIREAL; [51]) should be less than .3;(c) the Factor Determinacy Index [51] (FDI; [51]) should be greater than .90;and (d) the Generalized H Index (G-H; [52]) should be greater than .80.
ECV measures the dominance of the first factor over the rest of the factors.Values above .80allow us to conclude that the solution is essentially unidimensional [53].MIREAL is the mean of the absolute loadings on the second factor MRFA (Minimum Rank Factor Analysis), and assesses the extent to which the structure of the data deviates from unidimensionality.As a practical rule, values above .30indicate the absence of a relevant residual factor.FDI is the correlation between factor score estimates and the levels of the latent factors they estimate [54].Values above .80are acceptable.Finally, G-H measures the degree to which a factor is correctly represented by a set of items, i.e., the maximum proportion of factor variance that can be explained by its indicators (construct reliability), with values above .70being acceptable.
To calculate the indices described above, we estimated an exploratory bifactor model for each facet using Minimum Rank Factor Analysis.
After ensuring that all factors reached a sufficient degree of unidimensionality and conditional independence, we estimated a Graded Response Model (GRM; [55]) for each dimension.We then inspected the discrimination and difficulty parameters for each item, as well as the information functions of the test.
Dimensionality analyses were performed with the FACTOR, v. 12.04.04software [56].IRT analyses were performed using Mplus, v. 8.10 [33].In order to examine the literature on the topic published in recent years, a WoS search was carried out (March 2022).212 references were found using the following search terms: "heritage AND (evaluat* OR assessment OR scal*) in Title.Document Types: Article.Database: Web of Science Core Collection.Publication Years: 2010 to 2023.Research Areas: Arts Humanities Other Topics or Social Sciences Other Topics or Psychology."As mentioned in the introduction, none of the works retrieved were dedicated to developing specific instruments for assessing heritage learning in digital contexts.

Instrumentation
Therefore, in view of the lack of instruments, and putting the focus on the concepts included in the heritage sequence [57], a pool of items was drawn up to measure each of the seven dimensions of the sequence.
In the design and general implementation of the instrument, we followed the common postulates and recommendations for the development of scales and assessment instruments.In the wording of the items, we followed the usual rules in the construction of items of probabilistic scales for summative estimates [58][59][60] (h) items should be written in clear, simple, straightforward language; (i) sentences should be short (i.e., they should not exceed 20 words); (j) each sentence should contain only one complete idea; (k) statements containing extreme expressions such as "all", "always", "none" or "never" should be avoided; (l) items should not contain adverbs such as "only", "solely", "merely" or similar ones; (m) statements should be formulated in simple rather than compound or complex sentences; (n) vocabulary should be accessible to potential respondents; (o) item valence should be positive; and (p) items should not contain negative or double negative expressions.

Evidence of validity
Content-based validity evidence.In order to ensure the relationship between the content of the instrument and the construct it was intended to measure [58], both logical (clarification of the content through focus groups) and empirical (submission of the items to expert judgment, as detailed below) analyses were carried out.
Following the recommendations mentioned above, an initial pool of 117 items was drawn up, the content of which was submitted to 40 independent expert judges/raters, who had to evaluate on a scale of 1 to 4 points (a) the clarity of the item formulation; (b) the relevance or importance of each item for measuring the dimensions of the sequence; and (c) the suitability or appropriateness for measuring these dimensions.In addition, the judges had to indicate to which of the seven theoretical dimensions each item could be ascribed on the basis of its content.The judges issued their ratings online, through the LimeSurvey platform, during the second half of May 2022 (the matrix of judges' ratings can be found in the Supporting Information).
During first screening analysis, items with a mean lower than 3 and a standard deviation higher than 1 according to the rating given by the judges were discarded.This first screening resulted in a set of 97 items that met these requirements (see Supporting Information, S1-S7 Tables).
The results of the analysis of the agreement matrices using the B N coefficients for nominal data and the B w N coefficients for ordinal data from Bangdiwala (Bangdiwala & Shankar, 2013) are shown in Fig 4 .The overall coefficients of agreement can be considered very satisfactory.Thus, the degree of agreement was almost perfect in relevance (B w N = .811)and appropriateness (B w N = .808),and substantial in clarity (B w N = .772)and dimension (B N = .539),in accordance with the interpretation guidelines proposed by Muñoz and Bangdiwala (1997, p. 111) [66].
The agreement of the 3,880 decisions made by the judges in their ascription of the items to each of the seven dimensions resulted in a Fleiss Kappa value of κ = .671;the observed agreement, OA = .722and the Krippendorff alpha coefficient, α = .671(see complete results in Supporting Information, S11-S17 Tables).Taking into account the magnitude of the aforementioned coefficients, the overall agreement among the judges can be considered substantial [66,67].
Evidence based on internal structure.Factor structure.Table 2 shows the results of the factor analysis (the polychoric correlations among items are delineated in S8 Table of the Supporting Information).Each unidimensional baseline model was estimated with 14 degrees of freedom; the two additional free parameters of each model reported in Table 2 correspond to the estimation of two correlations between the residuals of pairs of items that, showing clear semantic similarity, obtained MI (Modification Index) and SEPC (Standardized Expected Parameter Change) substantially greater than 10 and 0.3, respectively.The fit of the unidimensional models was reasonably high, with RMSEA values between .086(RES scale) and .037(ENJ scale), CFI values between .983(RES scale) and .998(ENJ scale), and SRMR between .029(RES scale) and .011(ENJ scale).
Convergent and discriminant evidence.Table 3 shows the reliability and internal consistency estimators from raw scores (Cronbach's alpha) and unidimensional models (McDonald's omega and GLB), as well as the composite reliability (CR) and the item convergent validity estimator (AVE).All alpha, omega, GLB and CR values were above .80,with the minimum being observed for the RES scale (α = .83,ω = .83,GLB = .87,CR = .87)and the maximum for the ENJ scale (α = .90,ω = .89,GLB = .91,CR = .92).The AVE values were satisfactory in all cases except for the RES factor, with an AVE value = .48,very close to the minimum value necessary (.50) to guarantee the convergent validity of the factor.It should be noted, in any case, that the value .50 is within the limits of the confidence interval used.
The ESEM model showed a reasonably high fit (RMSEA = .036;CFI = .977;SRMR = .020).However, this result was to be expected given the high parameterization of the model.Table 4 shows the standardized factor loadings, and the Item Explained Common Variance (iECV).The iECV quantifies the variance captured by the item in its reference factor, versus the amount of common variance captured by all possible cross-loadings.Accordingly, here we use the iECV as an estimator of the item's ability to discriminate between its theoretical membership factor and all other factors, with a minimum desirable value of .50(an iECV � .50indicates that the primary factor explains as much or more common variance in item responses than all other factors combined).
Regarding the value of the primary loadings and cross-loadings, it is observed in the first placed that the model has satisfactorily recovered the theoretical structure, given that in all items the most salient loading is always the one corresponding to the primary factor (see Fig 5).Secondly, the iECV values were in a range between .374(res26) and .977(res30), with 45 of the 49 items showing a value above .50.In conclusion, it was possible to reproduce from the data a structure highly consistent with that expected by the theoretical model, without the need to eliminate items or introduce modifications into the model specification.
The correlations between the factors (S18 Table, Supporting Information) were adequate in all cases, ranging from -.075 (RES-CAR) to .602(VAL-UND).
Invariance analysis.Tables 5 and 6 show the results of the invariance analysis by gender and age.
Invariance by gender.Regarding gender, differences in favor of the scalar model were observed in all the indices (RMSEA = -.009;ΔCFI = .007;ΔTLI = .013;ΔAIC = -216; ΔBIC = -1973), except in SRMR, with a slight difference in favor of the configural model (ΔSRMR = -.004).This result suggests the absence of substantial differences in the model parameters according to the gender of participants.The category "non-binary" has not been included in this analysis due to the low number of participants (N = 13) who indicated this option.
Invariance by age.With respect to age, the saturated model obtained a slightly better fit (ΔRMSEA = .001;ΔCFI = -.001;ΔAIC = 75; ΔBIC = 1345; ΔSRMR = .004).We further investigated the local fit of the invariant model in order to detect regression parameters between age and each item that, when set to zero, would reveal a relevant misspecification.However, we found no clear evidence that the misfit of the invariant model was caused by a particular subset of items, but rather by the accumulation of low magnitude misfits spread across all restricted parameters.Given these results, and the small size of the differences in fit between the invariant and the saturated model, we chose to attribute the differences in fit to a greater parameterization of the saturated model, and not to the presence of relevant invariance problems.IRT analysis.Table 7 shows the parameters obtained after estimation of the seven GRM models.The α discrimination parameters ranged from 1.236 (res26) to 3.430 (tra86).According to the classification proposed by Baker and Kim (2017) [68], one item obtained a discrimination parameter of moderate size (1.236), six items of high size (between 1.457 and 1.675), and 42 items of very high size (between 1.691 and 3.430).The β parameters were generally adequate, covering in all items a sufficiently wide theta range.However, item res30 ("I have a respectful attitude towards the diversity of personal heritages") showed an extremely low β 1 value (β 1 = -5.752),indicating that this item is extremely "easy" given the characteristics of the sample.Other items showed results opposite to the one described, with very high β 1 values.This effect was mostly concentrated in the CAR scale.For example, item car60 ("I collaborate in action networks for the protection of heritage and to prevent the dangers of not taking care of it") showed values β 1 = 1.026, β 2 = 4.193, and β 3 = 6.603.This implies that it is very unlikely to observe an affirmative response ("sometimes" or higher), except in people who show a substantially high level of commitment to active heritage care.
Next, we examined the behavior of each scale by inspecting the Test Information Curves (TICs) depicted in panels (a) through (g) of Fig 6 .The KNO, UND, VAL, and ENJ scales were maximally informative over a wide range of the latent variable, ranging from approximately -1.5 to 1.5 standard deviations around the mean.This result suggests that the scales measured their respective constructs quite reliably in people with low, medium, and high levels of the latent variable.The TRA scale showed a slightly right-shifted TIC, with maximum information in a range between approximately -0.5 and 1.5 theta values.The TICs of the CAR and RES scales showed information profiles that were substantially different from the rest of the scales.The TIC of the CAR scale showed a strong shift to the right of the latent continuum, with maximum information between approximately 0.2 and 2.2 standard deviations above the mean of the latent variable.This implies that the scale discriminates well between people who manifest a medium-high to very high level of CAR, but may have difficulty in accurately detecting individual differences in the low range of the variable.The RES scale, on the contrary, presents a TIC that is strongly shifted to the left of the latent continuum, with maximum information between approximately -2.5 and 0.7 standard deviations around the mean, discriminating accurately between people with medium to low/very low levels on the variable, but with discrimination problems at high and very high levels.
Taking into account the content and purpose of the RES and CAR scales, and the characteristics of the sample, we can conclude that the results described are not unexpected, and do not pose a problem in terms of the validity and usefulness of the measure, for the reasons given below.
The RES scale consists of statements about respect both for heritage as a whole (e.g., "I respect all heritage assets, even if I do not feel identified with some") and for diversity of tastes and opinions (e.g., "I urge others to be respectful of any type of cultural heritage").Respect for the common good and tolerance of dissent are widespread principles in Western European culture.Thus, it is to be expected that in a questionnaire focused on these values we would obtain a majority of favorable responses and, therefore, maximum discrimination in low areas of the variable (i.e., among people who express neutral or negative attitudes).This expectation is consistent with the results of the analysis, which enables us to conclude that the RES scale: 1. 1. Discriminates well between people who hold attitudes that we might consider normative in Western society (i.e., valuing the common good positively, respecting diversity), and people who deviate from the norm (i.e., valuing neutrally or negatively); and 2. 2. Discriminates well against individual differences in the second group.
The CAR scale, on the other hand, focuses on the evaluation of overt behaviors related to heritage care.It is expected that participants will find the CAR items difficult, and that the discriminative power of the scale will be optimal at medium to high levels of the latent variable, given that: Unlike the other scales, CAR is organized as a "unipolar dimension" [69], where the negative pole does not represent neglect or mistreatment of heritage, but rather the absence of caring behaviors.Thus, it is logical that the CAR scale should accurately discriminate between people who actively engage in the defense of heritage and those who do not (or do so very infrequently), and should more accurately grade the intensity of active involvement among people in the first group.The CAR scale, understood as a sample of heritage care behaviors, is limited to actions that take place in social and online media, leaving out of the measurement individual or collective actions that occur exclusively in face to face interactions or by other means.This restriction in the sampling has as an expected consequence a lower observed frequency of caring behaviors, which translates into higher difficulty parameters.
Taking together the results of the CAR scale and the other scales (especially RES), we observe that in this sample the probability of taking actions in favor of heritage is much lower than that of expressing beliefs or "feelings" in favor of heritage.This apparent incongruence was to be expected, given the complex relationship between beliefs and overt behaviors, which should be a logical consequence of the former (see, e.g., Ajzen & Fishbein, 1977 [70]).

Discussion and conclusions
Q-Herilearn has demonstrated metric guarantees of sufficient validity and reliability as an instrument to accurately measure the processes involved in heritage learning.Given that there are significant differences in heritage learning outcomes depending on the particular digital medium or mediator in which they have occurred [71], a scale is needed that can be equally used in all digital environments by focusing on structuring dimensions in heritage learning.In this sense, Q-Herilearn would allow comparing the learning outcomes around the same heritage content in different contexts or with different educational mediation strategies.

Implications
The applicability of the scale encompasses the set of processes and procedures involved in heritage education, i.e., teaching, learning, implementation processes, media/mediators and contexts.
In terms of heritage education-and, in particular, the design of educational program-the 7-dimension structure (which covers the complete sequence of heritage processes) makes it possible to identify the objectives of any heritage education program; each dimension is supported by a verb, and the verbs make up the teaching objectives and, therefore, the heritage learning outcomes.In addition, the items of the scale for each dimension allow to operationally define the learning objectives, so that they can be used individually or in order to relate items from the different dimensions.
In turn, Q-Herilearn will serve as a measurement instrument in the implementation processes of heritage education programs in digital environments, permitting the evaluation of the degree and scope of heritage learning outcomes along the seven dimensions of the HLS, both globally and for each of them individually.
Heritage can be considered as a key element in promoting social cohesion through experiences in virtual environments, in that it equalizes or improves access to opportunities for many people in different geographical areas.In this sense, Q-Herilearn has been calibrated and standardized to be applicable to different contexts, including its translation and adaptation into five other languages (English, French, Basque, Italian and Portuguese).

Limitations
This study has several limitations.The most important ones refer to the use of a non-probabilistic (incidental) sample.Although the Monte Carlo analysis has shown that the N value used guarantees sufficient precision and statistical power, it should be noted that the non-probabilistic nature of the sample may affect the external validity of the results.In this regard, the three main weaknesses of the study should be noted, which have to do with (a) a limited potential for generalizability, as the sample may not accurately represent the characteristics, diversity or demographics of the population; (b) the selection bias, as the very nature of the data collection instrument (an Internet survey) could result in a portion of the population being overrepresented in the sample; and (c) the lack of variability, as the limited diversity within the sample could restrict the range of responses and reduce the applicability of the results to a broader population.These shortcomings suggest that future research should use a probabilistic sampling methodology based on random selection procedures that provide a higher likelihood of obtaining representative samples from the different populations on which the instrument is applied.

Future avenues for research
An explanatory model (HPM) has been used to articulate the learning processes in Heritage Education (HLS) that (a) is based on international references, (b) covers a complete cycle in heritage learning and (c) is generalizable and adaptable to different educational designs.The accuracy and consistency of the measure has been demonstrated both in the general scale and in each of the subscales.From here on, the immediate lines of research are geared toward: Investigating the usefulness of the scale in applied contexts.For example, gauging the extent to which the scale factors are sensitive to change predictably caused by heritage education programs.
Getting to know which are the most frequent procedures followed by users to learn about heritage in digital environments; i.e., in what ways heritage is learned and what specific learning profiles exist through mixed models (factorial-latent classes).
Applying the full scale in digital heritage learning environments and on different populations to check whether or not there are differences according to socio-demographic traits (e.g., general users, university students, minority groups, people who share different degrees of engagement with heritage, cultural backgrounds, etc.).
Using partial scales-individually or jointly-to measure heritage learning outcomes derived from the implementation of educational designs (these scales would be selected according to the verbs that articulate the objectives of these designs).
Comparing the responses obtained according to the language in which they were answered (i.e., Spanish, English, French, Italian, Portuguese and Basque), or to the bilingual nature of societies with minority languages.

Fig 1 .
Fig 1.Schematic representation of the one-dimensional and ESEM models.A Unidimensional models.B ESEM model.https://doi.org/10.1371/journal.pone.0299733.g001 shown in Fig 3, we plot the robust Mahalanobis squared ordered distances of the observations against the empirical distribution function of MD 2 i .Fig 3A shows the maximum value curve of the MD 2 i , distribution, while Fig 3B shows the maximum values detected by the specified quantile (97.5%).Multivariate outliers, i.e. observations outside the 97.5 quantile of the χ 2 distribution (N = 26, 1.79%) marked in red in Fig 3B (the numbers correspond to the observations in the original database) were removed.The first subfigure shows the peak value curve of the MD 2 i distribution, and the second subfigure shows the peak values detected by the specified fitted quantile (97.5%).

Fig 2 .Fig 3 .
Fig 2. PFScores.Cutoff = -1.73 and -3.00.https://doi.org/10.1371/journal.pone.0299733.g002 Item development and first review.The Q-Herilearn scale is a probability scale of summative estimates that measures different aspects of the learning process in Heritage Education.It consists of the seven factors (Knowing, Understanding, Respecting, Valuing, Caring, Enjoying and Transmitting) defined in the introduction to this paper.Each dimension is measured by seven indicators scored on a 4-point frequency response scale (1 = Never or almost never; 2 = Sometimes; 3 = Quite often; 4 = Always or almost always).
: (a) item content should refer to the present; (b) item content should not refer to facts unrelated to the respondent; (c) item content should have only one interpretation; (d) item content should be relevant to the dimension it is intended to measure; (f) avoid extreme statements (i.e., statements that can be endorsed by almost everyone or almost no one); (g) items should cover the full range of each dimension;